Metric Learning by Collapsing Classes
نویسندگان
چکیده
We present an algorithm for learning a quadratic Gaussian metric (Mahalanobis distance) for use in classification tasks. Our method relies on the simple geometric intuition that a good metric is one under which points in the same class are simultaneously near each other and far from points in the other classes. We construct a convex optimization problem whose solution generates such a metric by trying to collapse all examples in the same class to a single point and push examples in other classes infinitely far away. We show that when the metric we learn is used in simple classifiers, it yields substantial improvements over standard alternatives on a variety of problems. We also discuss how the learned metric may be used to obtain a compact low dimensional feature representation of the original input space, allowing more efficient classification with very little reduction in performance. 1 Supervised Learning of Metrics The problem of learning a distance measure (metric) over an input space is of fundamental importance in machine learning [10, 9], both supervised and unsupervised. When such measures are learned directly from the available data, they can be used to improve learning algorithms which rely on distance computations such as nearest neighbour classification [5], supervised kernel machines (such as GPs or SVMs) and even unsupervised clustering algorithms [10]. Good similarity measures may also provide insight into the underlying structure of data (e.g. inter-protein distances), and may aid in building better data visualizations via embedding. In fact, there is a close link between distance learning and feature extraction since whenever we construct a feature f(x) for an input space X , we can measure distances between x1; x2 2 X using a simple distance function (e.g. Euclidean) d[f(x1); f(x2)℄ in feature space. Thus by fixing d, any feature extraction algorithm may be considered a metric learning method. Perhaps the simplest illustration of this approach is when the f(x) is a linear projection of x 2 <r so that f(x) = Wx. The Euclidean distance between f(x1) and f(x2) is then the Mahalanobis distance kf(x1) f(x2)k2 = (x1 x2)TA(x1 x2), where A = W TW is a positive semidefinite matrix. Much of the recent work on metric learning has indeed focused on learning Mahalanobis distances, i.e. learning the matrix A. This is also the goal of the current work. A common approach to learning metrics is to assume some knowledge in the form of equivalence relations, i.e. which points should be close and which should be far (without specifying their exact distances). In the classification setting there is a natural equivalence relation, namely whether two points are in the same class or not. One of the classical statistical methods which uses this idea for the Mahalanobis distance is Fisher’s Linear Discriminant Analysis (see e.g. [6]). Other more recent methods are [10, 9, 5] which seek to minimize various separation criteria between the classes under the new metric. In this work, we present a novel approach to learning such a metric. Our approach, the Maximally Collapsing Metric Learning algorithm (MCML), relies on the simple geometric intuition that if all points in the same class could be mapped into a single location in feature space and all points in other classes mapped to other locations, this would result in an ideal approximation of our equivalence relation. Our algorithm approximates this scenario via a stochastic selection rule, as in Neighborhood Component Analysis (NCA) [5]. However, unlike NCA, the optimization problem is convex and thus our method is completely specified by our objective function. Different initialization and optimization techniques may affect the speed of obtaining the solution but the final solution itself is unique. We also show that our method approximates the local covariance structure of the data, as opposed to Linear Discriminant Analysis methods which use only global covariance structure. 2 The Approach of Collapsing Classes Given a set of n labeled examples (xi; yi), where xi 2 <r and yi 2 f1 : : : kg, we seek a similarity measure between two points in X space. We focus on Mahalanobis form metrics d(xi;xj jA) = dAij = (xi xj)TA(xi xj) ; (1) where A is a positive semidefinite (PSD) matrix. Intuitively, what we want from a good metric is that it makes elements of X in the same class look close whereas those in different classes appear far. Our approach starts with the ideal case when this is true in the most optimistic sense: same class points are at zero distance, and different class points are infinitely far. Alternatively this can be viewed as mapping x via a linear projection Wx (A = W TW ), such that all points in the same class are mapped into the same point. This intuition is related to the analysis of spectral clustering [8], where the ideal case analysis of the algorithm results in all same cluster points being mapped to a single point. To learn a metric which approximates the ideal geometric setup described above, we introduce, for each training point, a conditional distribution over other points (as in [5]). Specifically, for each xi we define a conditional distribution over points i 6= j such that pA(jji) = 1 Zi e dAij = e dAij Pk 6=i e dAik i 6= j : (2) If all points in the same class were mapped to a single point and infinitely far from points in different classes, we would have the ideal “bi-level” distribution: p0(jji) / n 1 yi = yj 0 yi 6= yj : (3) Furthermore, under very mild conditions, any set of points which achieves the above distribution must have the desired geometry. In particular, assume there are at least r̂ + 2 points in each class, where r̂ = rank[A℄ (note that r̂ r). Then pA(jji) = p0(jji) (8i; j) implies that underA all points in the same class will be mapped to a single point, infinitely far from other class points 1. Proof sketch: The infinite separation between points of different classes follows simply from Thus it is natural to seek a matrix A such that pA(jji) is as close as possible to p0(jji). Since we are trying to match distributions, we minimize the KL divergence KL[p0jp℄: min A Xi KL[p0(jji)jpA(jji)℄ s:t: A 2 PSD (4) The crucial property of this optimization problem is that it is convex in the matrixA. To see this, first note that any convex linear combination of feasible solutions A = A0 + (1 )A1 s.t. 0 1 is still a feasible solution, since the set of PSD matrices is convex. Next, we can show that f(A) alway has a greater cost than either of the endpoints. To do this, we rewrite the objective function f(A) =PiKL[p0(jji)jp(jji)℄ in the form 2: f(A) = X i;j:yj=yi log p(jji) = X i;j:yj=yi dAij +Xi logZi where we assumed for simplicity that classes are equi-probable, yielding a multiplicative constant. To see why f(A) is convex, first note that dAij = (xi xj)TA(xi xj) is linear in A, and thus convex. The function logZi is a logP exp function of affine functions of A and is therefore also convex (see [4], page 74).
منابع مشابه
Composite Kernel Optimization in Semi-Supervised Metric
Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...
متن کاملFormation of Higher-dimensional Topological Black Holes
We study higher dimensional gravitational collapse to topological black holes in two steps. Firstly, we construct some (n + 2)-dimensional collapsing space-times, which include generalised Lemâıtre-Tolman-Bondi-like solutions, and we prove that these can be matched to static Λ-vacuum exterior space-times. We then investigate the global properties of the matched solutions which, besides black ho...
متن کاملScaling Up a Metric Learning Algorithm for Image Recognition and Representation
Maximally Collapsing Metric Learning is a recently proposed algorithm to estimate a metric matrix from labelled data. The purpose of this work is to extend this approach by considering a set of landmark points which can in principle reduce the cost per iteration in one order of magnitude. The proposal is in fact a generalized version of the original algorithm that can be applied to larger amoun...
متن کاملOnline Open World Recognition
As we enter into the big data age and an avalanche of images have become readily available, recognition systems face the need to move from close, lab settings where the number of classes and training data are fixed, to dynamic scenarios where the number of categories to be recognized grows continuously over time, as well as new data providing useful information to update the system. Recent atte...
متن کاملFew-Shot Learning with Meta Metric Learners
Existing few-shot learning approaches are based on either meta-learning or metriclearning, which would suffer if the tasks have varying numbers of classes and/or the tasks diverge significantly. We propose meta metric learning to deal with the limitations of the existing few-shot learning approaches. Our meta metric learning approach consists of two components, task-specific learners that explo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005